Introduction

In this report, we summarize data from two datasets, one on geopotential height values in the northern hemisphere global region ranging from 35 degrees N to 70 degrees N, with a time dimensions ranging from 1944 to the present, and the other with details of flooding events from the time period ranging from 1985 to 2016. We present this data using visual representations that show the overall trends, and we gain insights into the trends and relationships between the following variables:

Monthly Flood Timeline

Here we present an interactive timeline of the flooding events from 1985 to 2015. The size of the circles indicate the number of people that were displaced as a result of the flood. You can pause the timeline and hover over specific flooding event to find out more details about it.

(Pause and hover over the floods to see more information!)

Preliminary Flood Data Analysis

Here we show basic flooding statistics. The following graphs represent the number of flooding events per month for each year, and we can see that some years exhibit increased flooding in the late summer months.

Geopotential Height Data

We compute the approximate average and median geopotential height (phi) data for New York City for every day for the last 65 years. More specifically, these calculations are performed over the 2.5 degree by 2.5 degree geo-coordinate grid that contains New York City.

Flood data

Affected sq. km

The following diagrams are lattice plots for the average area in square kilometers affected by floods per month and per year. This information is broken up into two full decades: the 1990s and the 2000s.

ANOVA Analysis

We examine whether there are any statistically significant differences in area affected by floods among the various months and years, as well as between the last two decades. In order to do so, we perform several ANOVA (Analysis of Variance) tests, all at a rejection level of 0.05. All of these tests are done on the assumption that the areas are derived from a Gaussian distribution with an unknown but fixed variance.

The first is a one-way ANOVA test for the null hypothesis that the average area affected is the same between the 1990s and the 2000s. This test results in a p-value of 0.00285, so we can reject the null hypothesis.

##               Df    Sum Sq   Mean Sq F value  Pr(>F)   
## DECADE         1 4.310e+11 4.310e+11   8.914 0.00285 **
## Residuals   3196 1.545e+14 4.834e+10                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The second is a two-way ANOVA test for the null hypothesis that the average area affected is the same for all months and all years between 1990 and 2009. This test provides a p-value of 0.00329 for the month, 1.68e-05 for the year, but 0.37483 for the interaction between month and year. The results imply that there is a statistically significant difference in average area affected per month, as well as per year. These results make sense considering the effect of seasonality on the weather: different months of the year correspond to the rainy season, and different years have different global weather patterns (ex. El Nino and La Nina). However, both of these factors together do not have a statistically significant impact on the average area affected by floods. Thus, even though there may be seasonal differences between months and between years, there has been little overall change in the scope of flood impact area within the past two full decades.

##               Df    Sum Sq   Mean Sq F value   Pr(>F)    
## MONTH         11 1.334e+12 1.212e+11   2.549  0.00329 ** 
## YEAR          19 2.677e+12 1.409e+11   2.962 1.68e-05 ***
## MONTH:YEAR   209 1.024e+13 4.897e+10   1.030  0.37483    
## Residuals   2958 1.407e+14 4.756e+10                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Regarding the apparent contradiction between the results of the two ANOVA tests: the one-way test showed that there is a per-decade difference when ignoring monthly and yearly seasonal trends, which the two-way test takes into account.

Monetary Damages

The next set of diagrams are lattice plots for the average amount of damage in USD (United States dollars) due to floods per month and per year. This information is also split by the last two calendar decades.

ANOVA Analysis

We examine whether there are any statistically significant differences in flood damage among the various months and years, as well as between the last two decades. We run two ANOVA tests at a rejection level of 0.05. As before, all of these tests are done on the assumption that the damage values are derived from a Gaussian distribution with an unknown but fixed variance.

The first is a one-way ANOVA test for the null hypothesis that the average value of flood damages is the same between the 1990s and the 2000s. This test results in a p-value of 0.0214, so we can reject the null hypothesis.

##               Df    Sum Sq   Mean Sq F value Pr(>F)  
## DECADE         1 2.578e+20 2.578e+20   5.308 0.0214 *
## Residuals   1151 5.591e+22 4.857e+19                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 2045 observations deleted due to missingness

The second is a two-way ANOVA test for the null hypothesis that the average value of damages is the same for all months and all years between 1990 and 2009. This test provides a p-value of 0.578 for the month, 0.449 for the year, and 1.0 for the interaction between month and year. We fail to reject the null hypothesis. One possible reason why the test suggests little change is that whenever floods happen, especially in vulnerable communities, they may cause similar levels of damage, regardless of when they occur.

##              Df    Sum Sq   Mean Sq F value Pr(>F)
## MONTH        11 5.019e+20 4.563e+19   0.862  0.578
## YEAR         19 1.013e+21 5.333e+19   1.007  0.449
## MONTH:YEAR  187 5.149e+21 2.753e+19   0.520  1.000
## Residuals   935 4.950e+22 5.294e+19               
## 2045 observations deleted due to missingness

Trend Analysis

The lattice plots provided a compact view of how the flood damage and impact area changed on average per month and per year. The ANOVA tests helped determine that there were statistically significant changes over time. However, we need additional visualizations to gain insight on what those changes look like.

The following bar plot shows the average amount of flood damages per year. Some of the years with the highest values in damages had very high profile storms and weather patterns occur: 1991 had “the Perfect Storm” in New England, 1998 had El Nino, and Hurricane Katrina devastated New Orleans in 2005. Other than these extreme values, there is no observable trend between 1990 and 2009, confirming the results of the two-way ANOVA test. It is also worth noting that there is no data for flood damages after 2010. This finding may suggest that there are no floods that cause serious monetary damage after that time. However, the 2011 tsunami in Japan and Hurricane Sandy, among other events, render this notion highly unlikely.

The next plot displays the average area impacted by floods over time. There is fluctuation over each decade, with less area affected towards the beginning of a decade, and an increasing area with each year until the end of the decade. This visualization supports the explanation of seasonality for the ANOVA analysis results.

Keywords in News Reporting

We are curious about the words that journalists use on floods. Specifically, we want to see if the words used change as the floods differ in severity. Here we choose to categorize floods by the number of deaths each flood causes.

We perform a TF-IDF(Term Frequency, Inverse Document Frequency) step on each different flood news reporting. Each corpus of documents is the news of floods causing deaths in a given interval (i.e. 10-50). After performing stemming, stopword removal and number removal, the TF-IDF step gets the highest 40 keywords for each corpus.

Then we mark the weight of each keyword in each category. For example, if ‘killed’ is a top keyword in 3% of news reporting on flood that causes deaths in (50, 100) interval, we give it the weight 3.

We keep these weights in a 43x7 matrix. 43 is the number of high-density keywords and 7 is the intervals of death numbers caused by floods.

Here is a word cloud we generate, looking at floods causing deaths between 50 and 100. The keywords such as ‘caused’, ‘desperation’, and ‘sept’ appear large in size, indicating they tend to occur frequently in newsreporting on floods of this scale.

Interestingly, comparing the above word cloud against newsreporting on flood causing deaths between 100 and 500, we immediately see a difference. Now words such as ‘feb’, ‘evacuated’, and ‘inundated’ get more dense.

Finally we show an interactive heatmap to explore how the keywords change their density in newsporting, as the severity of floods increase.

The heatmap performs a dendrogram clustering on words. As one may expect, we see ‘flood’ and ‘flooded’ nearby each other in the clustering while ‘feb’ and ‘sept’ are far away on the spectrum. Perhaps more interestingly, we observe that keyword such as ‘abandon’ is most dense in reportings on floods with deaths <=50.

Causes of Floods

In this section we look at the causes of the floods. We pick the most popular 6 causes and do a string matching to categorize the floods.

We project the floods in past 6 years (2010-2015) onto the map. The area of the dots indicate the affecte area of the floods. We see some easily recognible patterns: majority of floods are caused by rain(green dots); moonsoon-caused floods (brown dots) are dominant in South Asia and East China. Tropical storms are, as the name suggests, popular in tropical area and so are the floods caused by them. Rare in frequency, floods caused by snow are mainly in high-latitude regions such as Russia and Kazakhstan

Principle Component Analysis on Countries

In this section, we performed priciple component analysis (PCA) on countries that experienced floods in a given year. Here we choose 7 variables from the GlobalFloodArchive data, namely Affected Area, Severity, Magnitude, Dead, Dispatched, Latitude and Longitude.

We use a PCA dimensionality reduction to project the 7 chosen dimensions onto 2. Also we would like to investigate if there are clustering on these countries. We use a color coding from green to red that indicates the ratio of deaths against severity of the flood. This color thus indicates how damaging this flood is. The more shifted on the red, the more damaging it is.

In this example we focus on floods in 2000:

We observe some clustering, such as the 4 floods in Brazil in the top part and the 2 floods in India in the bottom. The 2 PCs may be intepreted such that PC2 corresponds to how damaging a flood is, as most country-flood pairs in the upper half are red while the ones in the bottm are green

However, we do notice that there isn’t any clear classification on this plot between developped countries and developing countries. USA and Philippines are close to each other in the center and so are France and India on the left. Perhaps this suggests that flood is more of a global disaster regardless of the GDP or geo-location of a country.

Flood Death

The following plot shows deaths related to flooding events between 1985 and 2015.

Flood Monthly Magnitude

In the following maps, we plot flooding event of each month. We can find that the flooding may has some relationship with temperature. More flooding events have occurred between May and September.

Conclusion

From out ANOVA analysis we conclude that there is a statisitcally sigificant difference between the value of the damaged caused by flooding in the 1990s and the 2000s. We also can see from the graphs that flood events occur around rivers and coastal regions, and when coupled with trellis plots of flooding events across months by year, there appear to be time periods of increased flooding that coorespond with months in late summer.